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Abstract 

Our goal is to compute a policy that guarantees improved return over a baseline 
policy even when the available MDP model is inaccurate. The inaccurate model 
may be constructed, for example, by system identification techniques when the 
true model is inaccessible. When the modeling error is large, the standard solu¬ 
tion to the constructed model has no performance guarantees with respect to the 
true model. In this paper we develop algorithms that provide such performance 
guarantees and show a trade-off between their complexity and conservatism. Our 
novel model-based safe policy search algorithms leverage recent advances in ro¬ 
bust optimization techniques. Furthermore we illustrate the effectiveness of these 
algorithms using a numerical example. 


1 Introduction 

Many problems in science and engineering can be formulated as sequential decision¬ 
making under uncertainty. A common scenario in such problems in many different 
areas, such as online marketing, inventory control, health informatics, and computa¬ 
tional finance, is that we are given a batch of data generated by the current strategy(ies) 
of the company (hospital, investor), and we are asked to find a good or an optimal pol¬ 
icy. Although there are many techniques to find a good policy given a batch of data, 
there are not much results to guarantee that the obtained policy will perform well in the 
real system without deploying it. Since deploying an untested policy might be risky 
and harmful for the business, the product (hospital, investment) manager does not al¬ 
low it unless we provide her with some sort of guarantees on the performance of the 
policy, e.g., convince her that the policy performs at least as well her existing strategy. 

Our focus is on a model-based approach to this fundamental problem. In this ap¬ 
proach, we first use the batch of data and build a simulator that mimics the behavior 
of the dynamical system under studies (online advertisement, inventory system, emer¬ 
gency room of a hospital, financial market), together with an error function that bounds 
its accuracy, and then use this simulator to generate data and learn a (good) policy. The 
main challenge here is to have guarantees on the performance of the learned policy, 
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given the error in the simulator. This line of research is closely related to the area 
of robust learning and control. What makes our problem different than the standard 
robust learning and control |T| [6] 0 |9] !TD H3» [ \§j [22J is the existence of a baseline 
policy (e.g., current company’s strategy), which often has a reasonable performance. 
This difference allows us to develop algorithms, whose performance (the performance 
of their returned policy) is better than the standard robust methods that optimize for the 
worst-case scenario. 

In this work, we assume that 1) the sequential decision-making problem can be 
modeled as an infinite-horizon Markov decision process (MDP); 2) we are given a 
simulator of this system together with an error function that bounds its accuracy (we 
briefly discuss how the simulator and error function can be built from the batch of data 
in Appendix |A. 1 [ ); 3) we are given a baseline policy for the problem (e.g., the current 
strategy of the company); and 4) the performance of the baseline policy (baseline per¬ 
formance) is known (this is a reasonable assumption as the batch of data is often large 
enough to have an accurate estimate of the performance of its generating strategy), and 
our goal is to find a policy that is safe, i.e., performs at least as well as the base policy in 
the real-world. We present four algorithms to tackle this problem in Sections 3.1 to 3.4 


For each algorithm, we prove that its returned policy is safe and provide a bound on 
its performance loss w.r.t. an optimal policy of the real system. From each proposed 
algorithm to the next, the computational complexity grows, but at the same time, the 
chance of finding a safe policy other than the obvious solution of the baseline policy 
also increases (the safe policy search become less conservative). These major findings 
are summarized in Figure [I] whose the notations are clearly defined in latter sections. 
We show this change in the behavior of our algorithms through a simple example in 
Section p] This example also serves as a proof of concept for the algorithms in Section 
3.1 to |3~4| Another important difference between the algorithms in Section [3~T| to |3.3| 
and the one in Section 3.4 is that the latter directly works with the baseline policy, 
while the former uses it in an indirect way, and works with the baseline performance. 
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Figure 1: An overview of various methods for finding safe policies. Here A/R stands 
for acceptance/rejection of the offline test. The arrow illustrates the computational 
complexity/conservatism trade-off of different methods. 
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2 Preliminaries 

A 7 -discounted MDP is a tuple Ad = (X, A, r, P, po, 7 ), where X and A are the 
state and action spaces; r(x,a) £ [-i? m ax, -Rmax] is the bounded reward function; 
P(- \x, a) is the transition probability function; and po(-) is the initial state distribution. 
A solution to MDP Ad is a stationary policy 7 r that is a mapping from Ad’s state space 
to a distribution over its action space, i.e., 7 r : X x A —t [0,1]. We denote by Pig, 
the set of all such policies for MDP Ad. We define the performance of 7 r in the world 
modeled by MDP Ad as 

( T-l 

^2 7 t r(X t ,n(Xt)) | po 
t—0 

where X t is the random variable representing the state of the MDP Ad at time-step t 
and VJf is the value function of policy 7 r in Ad. We also define an optimal policy as 

7 t* e argmax Ten? p(tr,M). 

As mentioned in Section [T] we use the historical data to build a simulator of the 
system, together with an error function that measures its accuracy. We denote by Ad* 
and Ad the (unknown) true and simulated MDPs with transition probability functions 
P* and P, respectively^ In order to capture the deviation between these models, we 
make the following assumption: 

Assumption 1. For each (x, a) £ X x A, the error function e(x, a) bounds the L\- 
norm of the difference between the true and estimated transition probabilities, i.e., 

||P*(-|a;,a) - P(-|x,a)||i < e(x,a) . 



In many practical situations, the deviation between P* and P is bounded with 
high probability. Nevertheless, here we restrict the error bound to be deterministic 
to simplify the analysis in latter sections. Extending these results with probabilistic 
bounds is direct and omitted for brevity. More information on building the simulator 
and computing the L\ — deviation bound (using an empirical distributi on fo r simulator 
P and the Weissman distribution bound lITil ) is available in Appendix 


A.l 


Remark 1. Using the estimated transition probability function P and the error func¬ 
tion e, we may construct the uncertainty set 


U(P, e) = jp : ||P(-|x, a) — P(-|x, a)||i < e(x, a), Vx, a £ X x .4 j . 


This uncertainty set automatically defines an uncertainty setU(fA, e) for MDPs. It is 
clear that the true transition probability function P* (the true MDP Ad*) belongs to 
the uncertainty set U (P, e). 

Given Assumption^ we have the following upper-bound on the difference between 
the return of a policy it in the true and simulated MDPs (with transition probabilities 
P* and P, respectively). 

1 In this paper, we restrict our attention to error in the transition probability function to simplify the 
exposition; the results readily extend to the case with error in the reward function. 
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Lemma 2. Given Assumption [7] for any policy tt, the difference between the return of 
a policy n in the true and simulated MDPs is upper bounded as follows: 

\p{n,M)- p{n,M*)\ < p^{I - 

1-7 


where P* and are the transition probability of the true MDP and the error function 
when the actions are taken according to policy n. ( Appendix \ A.2) 


As discussed in Section [T] we assume that we are provided with a baseline policy 
-Kg and we have a very good approximation of its performance p(ttb, AT*). We call 
a policy 7 r safe if it is guaranteed to perform not worse than the baseline policy in the 
true MDP At*, i.e., p(n,AA*) > /?(7Tb, At*). In the next section, we aim to explore 
several methods to find a safe policy 7 r, given the simulator At, the error function e, 
the baseline policy ttb, and the baseline performance p(ttb, At*). For each method, 
we provide a bound on the performance loss of its returned policy 7 r w.r.t. an optimal 
policy of the true MDP 77 ^*, i.e., <P( 7 r) = p(, At*) — p{ / K 1 At*). 


3 Computing Safe Policies 

In this section, we present several different solutions to finding safe policies in MDP 
problems, organized from simple to more complex, and discuss their relative advan¬ 
tages and disadvantages. 

Before presenting any solution to this problem, let us look at the naive approach 
of solving the simulated MDP AT. Let ng be an optimal policy of AT, i.e., ng £ 
arg max 7rgIIs p(ir, A4) (or equivalently 7 rs = The following theorem quantifies 
the performance loss of this policy. 

Theorem 3. Let 7 Tg be an optimal policy of the simulator A4. Then under Assump¬ 
tion [7] we have 

H-ks) < (Appendix \A.3\ 

Unfortunately, there is no guarantee that 7 Tg is safe, i.e., it performs no worse than 
the baseline policy ttb- Thus, deploying ng may lead to undesirable outcomes due to 
model uncertainties. In the following sections, we present methods whose solutions 
are guaranteed to be safe. 

3.1 Solution based on a Reward Adjusted MDP 

In this section, we propose a method that relies on solving the MDP At = (X, A, r, I\ po, 7 ), 
which is exactly the same as the simulated MDP At, except that its reward function is 
adjusted as 

f(x,a) = r(x,a ) — ^^ max e(t:, a), VxCX,Va£A. (1) 

1-7 

The unique property of this MDP is that, when Assumption [I] holds, the performance 
of any policy 7 r in At is a lower-bound on its performance in the true MDP At*, 
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i.e., p(tt. Ad) < p(?r. Ad*) (see Theorem [d]i. Algorithm [T] summarizes the method 
for computing a policy using the reward adjusted MDP (RaMDP) AT It returns an 
optimal policy of Ad, when the performance of this policy in Ad is better than the 
baseline performance p(ttb, M.*), and returns ttb otherwise. 


Algorithm 1: Solution based on the RaMDP 

input : Simulated MDP Ad, baseline performance p(ttb,M*) and error 
function e 
output: Policy n Ra 

t r(x, a) 4— r(x , a) — 7 f™/ X e ( x > a) ; 

2 7r 0 4 — argmax wS n s p(7r,Ad) 

3 Po<- p(v 0 ,M) ; 

4 If Po > p{kb,M*) n Ra 4- 7T 0 n Ra 4- n B return n Ra 


Since the performance of any policy in Ad is a lower-bound on its performance in 
Ad*, it guarantes that the solution policy returned by Algorithm [T] Tr Ra , performs at 
least as well as the baseline policy n R - Theorem [4] shows that n Ra is a safe policy and 
quantifies its performance loss. 

Theorem 4. Given Assumption^ 7] the solution ir Ra of Algorithm's safe, i.e., pipt Ra ■ Ad*) > 
p(ttbi Ad*). Moreover, its performance loss $(7Tr 0 ) satisfies 

$ ( 7 r Ra) < min { ll e TU- . 

where uf ( * is the normalized state occupancy frequency of the optimal policy 

(Appendix \A.4) 

Note that Theorem[4]indicates that by this simple adjustment in the reward function 
of the simulated MDP Ad, we may guarantee that our solution is safe. Moreover, 
it shows that the bound on the performance loss of Ti Ra is actually tighter than that 
for the solution ~s of the simulator Ad in Theorem [ 3 ] In particular, the L ^—norm 
has been replaced by a weighted L\— norm. In terms of computational complexity, 
since Algorithm |T| only requires solving a standard MDP, it can be implemented by 
either value iteration , policy iteration or linear programming ED- The corresponding 
complexity is therefore 0(|_4| |A| 2 /(1 — 7 )) (for value iteration) lfT4l . which is low- 
polynomial in \A\, \X\ and 1/(1 — 7 ). 

While Algorithm[l]provides good theoretical guarantees and has low costs of com¬ 
putation, it may be overly conservative in many circumstances. This is because the 
adjustment of the reward function is based on the assumption that there exists a state 
with the optimal value of f? max /(1 — 7 ) and that this state is accessible from each other 
state with the reward f? ma x- Since this assumption is rarely true, we propose a more 
adaptive formulation in the following section via RMDP methods. 
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3.2 Solution based on the Robust MDP 

Robust optimization is a standard technique to deal with model uncertainty. In this 
section, we propose an algorithm for finding a safe policy via solving the robust MDP 
(RMDP). We prove that the policy returned by this algorithm is safe and has better 
(sharper) worst-case guarantees. 


Algorithm 2: Solution based on the RMDP 

input : Simulated MDP Ad, baseline performance p(n B , Ad*) and the error 
function e 
output: Policy n R 

// Construct the uncertainty set 

1 U(P, e) = | P : ||P(-|x, a) — P(-\x,a)\\i < e(x,a), Vx, a £ X x A j ; 

2 7T 0 <- argmax TeI]s min pgM(P e) p(n,M(P)) ; 

3 po <r- min p6W( p e) p(n 0 ,M(P)) ; 

4 If Po > p(tt b ,M*) 7 t r 4- 7T 0 n R 4- 7 t b return 7 : R 


The robust method is summarized in Algorithm[2] It first constructs an uncertainty 
set U(P, e) using the simulator Ad and error function e. It then solves the resultant 
RMDP (Ad(P) with P £ U (P, e)) and returns its solution, if its worst-case perfor¬ 
mance over the uncertainty set is better than the baseline performance p(n B , A4*), and 
returns n B otherwise. 

Algorithm [2] involves solving a (s, a-rectangular) RMDP. RMDPs satisfy many of 
the same properties as regular MDPs, such as the existence of an optimal stationary 
policies. RMDPs can often be solved quite efficiently by value iteration, policy it¬ 
eration, or modified policy iteration |[9] [ 12 ; T 6 ] GSO- In practical terms, solving an 
RMDP can be expensive due to the need to compute the worst-case transition probabil¬ 
ity which relies on solving a convex optimization problem for each state and action in 
every iteration. However, the robust solution can be computed very efficiently when the 
uncertainty set is described, as in our formulation, in terms of an L-\ -norm C3 ; sim¬ 
ilar results also exist for the L 2 norm |9l. Since implementing Algorithm [2] involves 
solving RMDPs, the complexity is 0(|A| | X\ 3 log(| X \)/(1 — 7 )) (for robust value iter¬ 
ation) m. It is higher than that in Algorithm [l]but is still polynomial in _A|, \X\ and 
1/(1 — 7 ). The following theorem shows that the policy n R is safe and quantifies its 
performance loss. 

Theorem 5. Givenjlj the nonempty solution n R of Algorithm's safe, i.e., [>(tt a, AT*) > 
p(yt B , Ad*). Moreover, its performance loss satisfies 

< min | Ik-JU*' ’ 

where is the normalized state occupancy frequency of the optimal policy 77 ^*. 

(Appendix \A.5) 

Compared to ns and the bound in Theorem [3] on its performance loss, Theorem [5] 
indicates that the policy n B returned by Algorithm[2]is safe and has a smaller bound on 
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its performance loss. In particular, the bound depends only on a weighted L\— norm of 
the errors for the optimal policy, instead of the norm over all policies. 

While the complexity of Algorithm [2] is higher than that of Algorithm [I] (because 
solving RMDPs is more complex than standard MDP), Theorem p] does not show any 
advantage for 7 r# over the policy hra returned by Algorithm [Hneither in terms of 
safety nor in terms of the bound on its performance loss (Theorem^. This arises the 
question that why should one use the more complex Algorithm[2]in place of Algorithm 
IT] Proposition [ 6 ] provides an answer to this question and shows whenever Algorithm 
^ returns a safe policy, so does Algorithm [2] while the converse is not necessarily 
true. This implies with extra computational complexity, the conservatism of safe policy 
search decreases. 


Proposition 6. Given [7] for each policy 7 r, we have 

min p(7r,A4(P)) > p(7r,A4) , 

PGU(P,e) 


where M is defined in Section 


3.1 


(A.6 


Note that the bound in 7 r# is based on the worst-case transition probabilities in 
U(P,e). While this approach guarantees to yield a safe policy, it may still be too 
conservative as solving a RMDP considers finding a safe policy under the worst-case 
scenario. Next, we investigate an alternative approach to return a safe but less conser¬ 
vative policy. 


3.3 Solution based on an Augmented Robust MDP 

As discussed at the end of Section |3.2| the goal in this section is to develop a new 
method that combines simulated and RMDPs, and reduces the conservatism of safe 
policy search compared to Algorithm[2] We start this section by considering the follow¬ 
ing constraint optimization problem that finds a policy that maximizes the performance 
in the simulator and satisfies the safety constraint: 

max p(tt,J\4), subject to p(n,fA *)> p(n B ,M*), (2) 

n-eii H 

where II# is the general set of history-based policies. To solve ([2]), we employ the 
Lagrangian relaxation procedure 0 to convert it to the following unconstrained opti¬ 
mization problem: 

max mmL(n, A) := p(n,M) + \(p(tt,M*) — p(7r#,AT*)), (3) 

71-611# A>0 V / 

where A is the Lagrange multiplier. Unfortunately, solving the optimization prob¬ 
lem 0 is impossible, since the true MDP AT* is unknown. Before describing how 
we tackle this issue, let us define a few terms and quantities. For the simulated MDP 
AT and any MDP AT (P) that is only different with AT in its transition probability 
function, and for any Ai, A 2 > 0, we define the augmented MDP AT Ai x . 2 (X / 
X, A, r A , P A ,p A , 7 ), where r A (x, y, a) = Air(x, a)+A 2 r(y, a), P A {x\ y'\x, y, a) = 
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P(x'\x,a)P{y'\y,a), and Po(x,y) = Po(x)po(y). Here we can think of x’s and y’s 
as the states evolved according to the MDPs A i(P) and M, respectively. We denote 
by ILg the set of stationary Markovian policies over this augmented MDP. Using the 
augmented MDP, for any policy 7r £ 11//, we define the Lagrangian function 

L P (n,\) = p(n,Mx tl (P)) — (4) 

Using the above definitions, the Lagrangian function in ([3]) may be easily written as 
L(n, A) = Lp* ( 7 r, A). From Theorem 3.6 in (2), it can be easily shown that for any 
transition probability function P £ U(P, e), the following strong duality holds: 

max minLp(7r, A) = max minPp(7T, A) = min max L P (ir,\). (5) 

irGlI H A>0 ttGII^ A>0 a>o „- e n£ 

Setting P = P*, 0 implies that we can replace the class of history-based policies 

II h with the set of stationary Markovian policies over the augmented MDP 11 {j in 

problem (|3}. This further means that 11{J is the class of dominating policies for the 
optimization problem Q. 

Recall that the duality theory a indicates that if the dual Lagrangian problem is 
bounded, the primal Lagrangian problem is always feasible. This means that if we 
can find a lower-bound for the dual Lagrangian problem min^>o max, fIT .i L{ 7T, A) 

(note that L(tt, A) = L P * (7r, A) and P* £ U(P, e)), the corresponding policy will be 
feasible for the constraint in ([2]), which itself means that it is safe. This motivates us to 
find a saddle-point for the (augmented) robust optimization problem 

min max min L P {7 r, A). (6) 

A>0 ttGII^ PG U(P,e) 

Note that compared to the Lagrangian function L(n, A), in 0, we have replaced the 
true transition probability P* with the worst-case transition probability over the uncer¬ 
tainty set U(P. e). The reason for finding a saddle-point of |6) is because the solution 
is a lower-bound for the dual Lagrangian: 

(a) . . (b) 

min max min L P (n,X)= min min max Pp(7r, A) > min max L(n, A), 

A>0 irGn^ PeU(P,e) PeVt(P,e) A >° Jren^ A>0 ttG^ 

(a) Theorem 1 in lfl6l shows that strong duality holds in (X x M)-rectangular robust 
optimization problems, i.e., 

min max Pp(7r, A) = max min L P (7 r, A), 

PGW(P,e) Tren^ TrGnt! PGW(P.e) 

(b) This is from the fact that L(tt, A) = L P * (n, A) and P* £ W(P, e). 

Therefore, if we find a saddle point (tto, A*) of the (augmented) robust optimization 
problem ([6}, then the corresponding policy 7r 0 is safe. Given the above observations, 
we now present Algorithm [3] and prove in Theorem [7] that the policy returned by this 
algorithm is safe and quantify its performance loss. On Line[2]of Algorithm[3] we use 
the conventional sub gradient descent approach to solve for a saddle-point. In this ap¬ 
proach, we first fix the Lagrange multiplier and solve for an optimal stationary policy 
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and then optimize for the Lagrangian multiplier (which is a convex optimization prob¬ 
lem). These two steps are repeated until the solution converges to a saddle point. More 
details of this procedure can be found in Appendix |A.7| Regarding to the computation 
cost of this approach, since each sub-gradient descent step involves a RMDP, similar to 
Algorithm [ 2 ] it has complexity of 0(|^l||A’| 3 log(|Aj)/(l — 7 )) for robust value itera¬ 
tion. Thus the total complexity of Algorithm[3]is 0(|*4||A’| 3 log(|Af|)/(l—7)(1 /y/K)) 
0, where K is the number of iteration of sub-gradient descent and 0(1/y/K) is the 
standard convergence rate for first order methods in convex optimization. 

Regarding to the augmented Markovian policy no in Algorithm [3] since ttq re¬ 
quires state information from both the uncertain MDP Af(P) (with state Xf) and sim¬ 
ulated MDP A4 (with state Y t ) . Therefore implementing this policy in real-world (i.e. 
P = P*) requires real-time state trajectories from the online simulator as well. This 
inevitably increases the complexity of implementation. 


Algorithm 3: Solution based on the Augmented RMDP 

input : Simulated MDP AT baseline performance p(nB,M*) and the error 
function e 
output: Policy 1 tar 

1 Construct the uncertainty set U(P, e) and augmented MDP 

M^(P), VP€W(P,e); 

2 Solve min^> 0 max Tgn j min pgW ^p ^ Lp(n, A) for a saddle-point (7r 0 , A*) ; 

3 If a saddle point solution (tto, A*) exists 7 tar 4— no t^ar 4— np return 
nAR 


Theorem 7. Given Assumption [7] the nonempty solution ttar of Algorithm [j] is safe, 
i.e., p(ttrs, M*) > p(np, Ad*). Moreover, its performance loss $( 774 ^) satisfies 


$(t t A r) < min 


^T-^max 11 

(I - 7 )2 l|e ^* 



where u* M t is the normalized state occupancy frequency of the optimal policy 

(Appendix\A.8 ) 


Similar to Section 3.2 compared to ns and the bound in Theorem [3] on its per¬ 
formance loss. Theorem [7] indicates that the policy 774 /,• returned by Algorithmj3] is 
safe and has a tighter bound on its performance loss. However, while Algorithm Blhas 
higher computational complexity than Algorithm [I] and Algorithm [2] Theorem [7]does 
not show any advantage for its returned policy 774 /->, over those returned by Algorithm[I] 
and Algorithm[2| n liA and np. This raises a question similar to that in Section [3~2] that 
why should we use Algorithm|3]instead of Algorithm|2]then? Proposition [ 8 ] provides 
an answer to this question and shows whenever Algorithm [2] returns a safe policy, so 
does Algorithm[3] while the converse is not always true. This again resonates with the 
fact that with extra computational complexity, we reduce conservatism of safe policy 
search. 
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Propositions. Given Assumption^] ifp(nB,A4*) < p(nR,Ai(P)), then 

Lp(ttar , A*) is lower-bounded. This means that if A Igorithm |2j returns a 
safe policy other than ttb, so does Algorithm^ (Appendix \ A.9) 


3.4 Combining Robust and Baseline Policies 

While the robust solution described in Algorithm [2] (or Algorithm [3J is less conser¬ 
vative than Algorithm |T] it may nevertheless be overly restrictive. This is because the 
proposed improved policy 7To is evaluated for the worst-case realization of the transition 
probability in U(P, e) while the return of the baseline policy is with respect to P*. As 
a result, a candidate policy 7r 0 can be rejected even if for any realization P € U(P, e) 
the policy is better than the baseline. The left example in Figure [2] depicts the case in 
which the robust solution will be too restrictive. 



Figure 2: Left: Example in which the policy returned by Algorithm[2]is too restrictive. 
Right: Example in which the policy returned by Algorithm [2] performs better than the 
baseline policy in some states but not in others. 

An additional limitation from Algorithm [I] [2] and [3] is that when the evaluation 
criterion of the computed policy fails, then the baseline policy is not improved at all. 
The restriction of these approaches is illustrated by the following counter example: 

Consider a simple case when the MDP Ad is composed of two separate MDPs 
A4i and Ad 2 and suppose that the estimated model of Adi is very good, while the 
model of Ad 2 is quite imprecise. Also assume that the initial distribution is uniformly 
distributed between an initial state in Adi and an initial state in Ad 2 - Intuitively, the 
best solution would use the optimized policy for Adi, which has a precise model, and 
use the baseline policy for Ado, which has an imprecise model. However, Algorithm 
[2] simply returns the baseline policy for both components because the return in Ad 2 
potentially reduces the quality of the robust solution. This phenomenon is illustrated 
in the right example of Figure[2]for which it would beneficial to combine the baseline 
policy with the optimized one instead of returning either one. 

The above limitations can be solved by modifying the robust optimization prob¬ 
lem. Intuitively, the objective is to find a policy that maximizes improvement over 
the baseline for any plausible transition probabilities. Equivalently this reduces to a 
distributionally robust optimization (DRO): 

tt\ £ argmax min (p(n,P) — p(ttb,P))- (7) 

* PGW(P,e) V ' 

Compared with Algorithm [T] [2] and [3] the first appealing fact of this approach is that 
the solution policy is always safe because ttb is feasible to 0- The second appealing 
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Figure 3: Improvement in return over the baseline policy for the proposed methods. 


fact is that it only requires the knowledge of baseline policy ttb, without the baseline 
performance p(ttb,P*). Nevertheless, 0 is a DRO problem which in general could be 
NP-hard to solve 0. Here the computational methods used to solve this problem are 
beyond the scope of this paper; we approximate the solution using a heuristic iterative 
algorithm based on value iteration. The additional complexity of this formulation again 
corroborates with the phenomenon that additional computational complexity reduces 
conservatism in safe policy search. Similar to the other three formulations, the fol¬ 
lowing theorem states the safety of the computed policy and describes its performance 
loss. 


Theorem 9. Given that Assumption [7 ] is satisfied, then a solution i tj to (|7]i is safe, i.e., 
p(tti, P*) > p(ttb, P*). Moreover, the performance loss ofitj satisfies: 


$(*,)< mm | y 


+ \\ e 7T B II 


, $(>«) 


where u 

base policies 7rJvt*, ttb respectively. Also, this bound is tight. 


M* ■ u b,m* are the normalized state occupancy frequencies of the optimal and 


(Appendix |/ 1 . 1Q\ 


4 Numerical Comparison 

In this section, we numerically evaluate the proposed methods on a synthetic bench¬ 
mark MDP. The benchmark problem loosely models customer interactions with an 
online system. The four available actions influence the user behavior along two dimen¬ 
sions. Rewards, which represent user satisfactions, vary only along the first dimension. 
The second dimension influences only transition probabilities. To simulate a realistic 
source of a baseline policy, we construct it to be optimal when the second dimension of 
the MDP is ignored. The simulator is constructed directly from the empirical transition 


probabilities. The transition error e is based on the sampling bounds in Section A.l and 
decreases with a square root of the number of samples. Figure |3]depicts the percentage 
improvement in total return over the baseline policy as a function of the overall number 
of samples used in constructing the simulator. The methods used in the comparison are 
as follows. The dashed line shows the return of the optimal policy. EXP stands for 
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the standard MDP method with the expected return objective. For a small number of 
samples, this standard method does significantly worse than the baseline policy. RWA 
stands for the method in Algorithm [T] which leads to a safe policy but, as expected, 
is overly conservative. ROB stands for the robust method in Algorithm[2j which also 
guarantees the safety of returned policies, but is much less conservative than RWA. 
Finally, RBC is the algorithm described in Section 3.4 RBC optimizes the policy in 
states with many samples and falls back onto the baseline policy otherwise. The com¬ 
bined policy of RBC is not only safe, but also significantly improves on the baseline 
policy even when the number of samples is small. 


5 Conclusion 

In this paper we presented four model based safe policy search methods and analyzed 
their performance. Ranging from computational complexity to conservatism, our ap¬ 
proaches provide a full gamut of tools to design good policies offline that match base¬ 
line performance. To the best of our knowledge, this line of work is novel in the RL 
community. Similar approaches in the model-free setup can be found in H US, 20], 
where safe policy evaluation takes place during exploration. 

On the technical side, an important future direction is to compare the performance 
of policies generated by different safe policy search algorithms and to explicitly study 
the solution algorithm in Section [3~4| On the experimental side, future work includes 
running advanced simulations in realistic domains such as battery charging/discharging 
control for smart grid systems. 
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A Proofs 


A.l Sampling Bounds 

f22l showed that L\ — deviation of the empirical distribution from the true distribution 
over to distinct events from n samples is bounded as 

P{ll P(') ~ P(') Hi > e} < (2 m - 2) exp (-^0 . (8) 

Now consider a fixed state-action pair (x. a) and assume that the transition probability 
P{-\x, a) has been estimated using N(x, a) visits to (x, a). The random event for the 
transition probability estimate is the state to which the system transits. In this case, 
to = | A” | and using (|8]» we may write 

\\P(-\x,a) -P(-|a;,a)||i < \J N( ^ Cl) lp g ( 2 5 2 )> 

with probability at least 1 — <5. Thus, by setting e{x,a) = \J N( ~ x a) log(-^ ^^^ 1 ~ 2 ^ , 
we can guarantee that P jp* (P) j < S. 


A.2 Proof of Lemma |2] 


Since the return of a policy 7r is the product of the initial state distribution and the value 
function of the policy, i.e., p(n, Ml) = p ( \ V]f, Lemma[2]is a direct consequence of the 
following lemma. 


Lemma 10. Consider two MDPs M\ and Mi 2 that are only different in their tran¬ 
sition probability functions P\ and P 2 , and reward functions r 1 and r 2 . Let 717 be a 
policy in M4i and 7 r 2 be a policy in Mi 2 . Under the assumption that for any state 
x € A”, ||P^ 1 ^!#) — P 2 r2 ( , |a;)||i < 9x, we have 


< V%-V % 3 < (I-Ti? 1 ) -1 + 




1-7 


1-7 


where g is the vector of g x ’s. Moreover, the above inequalities are tight. 


Proof. The difference between the two value functions can be written as follows: 


VXt 1 , - v m 2 = rT 1 + - rp - iPpV^ 

= rp + 7 PPV^ - r? - 7 PpV% 2 + 7 PpVfo - 7^% 

= (rp - r?) + 7 PpiVfo - V% 2 ) + 7 (Pr - P 2 2 )V ^ 2 
= (I - iPi 1 )- 1 K 1 ^ r? + 7(Pr ^ P£ a )V% 3 ] . 

Now using the Holder’s inequality, for any a: £ A”, we have 

|(Pr(-|*)--Pr(-|*)) T V^ a | < UPrOI^-^OWIIlll^Jloo < 9 X ||V^ 2 ||oo < 9: 
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The proof follows by uniformly bounding (Pf 1 — PJ 2 ) Vff., from the above inequality 
and from the monotonicity of (I — 7 Pf 1 ) -1 . □ 


A.3 Proof of Theorem |3] 

Proof. From Lemma^with 717 = 7 t 2 = 7 rg, Adi = Mi*, and Mi 2 = Mi. we have 
p(tt s ,MI) - (I - 7 Ks)~ le * s < P(^s,M*). 


Thus, we may write 


= P^m*. M *) - p(ps, M*) < p{n* M *, M*) - p(n s , Mi) + 


7“mM T/'r r>* \ —1, 


1-7 


p^I-jP*)-^ 


< p{^M *> Ad*) - p{, Ml) + 7 ^ iax po(I - 7-P^s) ^TTS 


TV s J ''" K S 

* \-l. 


( b ) 7 P 


1-7 

w 2 7 P m . 


max T 

Po 




(1-7 ) 2 


(a) Comes from the optimality of ns in Mi. 

(b) This is the application of Lemma[2]with policy 7Ti = 7T2 = tt^, . Mil = Mi*, and 
M 2 = M. 

(c) This is from the fact that for any policy 7 r, we have ||pj(l — 7 P*) _1 ||i = 1/(1 — 7 ), 

and from the application of the Holder’s inequality. □ 


A.4 Proof of Theorem [4] 

Proof. To prove the safety of itR a and bound its performance loss, we need to upper 
and lower bound the difference between the performance of any policy 7 r in the true 
MDP Mi* and its performance in M, i.e., pin. Mi*) — p(n,Mi). These upper and 
lower bounds are obtained by applying Lemma [T0| with 717 = 7 r 2 = 7 r, Adi = Ad*, 
and Ad 2 = Ad as follows 


p( 7 r,Ad*) -p(n,M) >_pJ(I- 7 P/:) 1 >0, (9) 

where the second inequality in |9} follows from the definition of the adjusted reward 
function r, and the fact that (I— 7 P *) -1 is monotone and p 0 is non-negative. Similarly, 
the upper-bound is 


p(n,Mi*) — p(n,Ml) < 


2^R n 


1-7 


-Po(I -70* = 


27 -Rn 


(I-7) 2 


^ 7 r l.u’l 


( 10 ) 


where u 7 ^* = (1 — *y)Pq (I — 7 P '**) 1 is the normalized state occupancy frequency 
of policy 7 r in the true MDP Ad*. 
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To prove the safety of the returned policy tt Ba , consider the two cases on Line [4] 
of Algorithm^ When the condition is satisfied, we have p{rr B ,M.*) < p(iro,A4) < 
p(7 To, Ad*), where the second inequality comes from <|9j, and thus, the policy tt a a = 
7 Tq is safe. When the condition is violated, then tt Ra is simply tt b , which is safe by 
definition. 

To derive a bound on the performance loss of the returned policy n Ra , consider also 
the two cases on Line[4]of Algorithm[T] When the condition is satisfied, using <[9]», we 
have 


$(7Tfla) = p(tt* m ,,M*) - p{tt 0 ,M*) < p(nXi*iM*) - p(n 0 ,M), 

and when the condition is violated, we have 


^(TTfla) = p{-K* M *iM*) - p(TTB,M*). 

Since the condition is satisfied onLinej-ijof Algorithm[l]when p(n 0 ,A4) > p(n B , M*), 
we may write 

^(TTita) < min|/9(7r - p(n 0 ,M) , p(P M ,,M*) - p(tt Bi M*)} . 

Note that we may write the following inequalities for the first term in the minimum 


.—- (a) — (b) 2°//? 

p(nXt*,M*)- p(tt 0 ,M) < p(tt* m *, M*) - p(k* m < , M) < 


( i _ 7 )2 


where (a) follows from 7r 0 being an optimal policy of MDP Ai, (b) is from ( fT0| ) with 
tt = ,, and u* M t is the normalized state occupancy frequency of the optimal policy 

tt^ ,. This proves the theorem. □ 


A.5 Proof of Theorem H] 

Proof. To prove the safety of tt r and bound its performance loss, we need to upper and 
lower bound the difference between the performance of any policy n in the true MDP 
M* and its worst-case performance, min pgW( . p ^ p(n,fA(P)y Since P* G U(P, e) 
from Assumption |T[ we have 

min p(jr, M(P)) < p(rr, M*). (11) 

PGU(P,e) 

Now let P be the minimizer in min P( _^p ^ p(n 1 A4(P)). The minimizer exists be¬ 
cause of the continuity and compactness of the uncertainty set. From Assumption [T] 
and the construction of U(P, e), for any (x, a) G X x A, we have 

\\P(-\x,a) - P*(-\x,a)\\i < ||P(-|a:,a) - P(-\x,a) + P{-\x,a) + P*(-\x, a)||i 

< ||-P(-k,a) - P(-\x,a) ||i + \\P(-\x,a) + P*(-\x, a)||i 

< 2e(x, a). 
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Applying Lemma [To] with 7Ti = 7T2 = 7t, Adi = Ad*, and Ad 2 = Ad(P), we obtain 

27iUx -pJ(I-7^)- 1 e.= ^ 


pir,M*)- p mm^M{.P))< -- (1 _ 7) 2 

( 12 ) 

where = (1 — y)pQ (I — yP.£) _1 is the normalized state occupancy frequency of 
policy 7 r in the true MDP Ad*. 

To prove the safety of the returned policy tt Ka , consider the two cases on Line[4]of 
Algorithm^ When the condition is satisfied, we have p(ttb, Ad*) < min pgW ^p e - ) p(7To, Ad(P)) < 
p(7To, Ad*), where the second inequality comes from ([TT}, and thus, the policy ttr = 7To 
is safe. When the condition is violated, then ttr is simply 7 t b , which is safe by defini¬ 
tion. 

To derive a bound on the performance loss of the returned policy ttr, consider also 
the two cases on Line[4]of Algorithm[2] When the condition is satisfied, using ( [TT} , we 
have 

<f>(n R ) = - min P (tt 0 ,M(P)) , 

P<3A{P,e) 

and when the condition is violated, we have 

$(tD?) = : - A/J *) - P(^B, Ad*) . 

Since the condition is satisfied on Line|4jof Algorithm|2jwhen min pgW jp ^ p(7To, Ad(P)) > 
p(ir b i Ad*), we may write 


$(7r fl ) < min < p(7r)t,,, Ad*) - min p(n 0 ,M(P)) - p(tt b ,M*) > . 

[ PGU(P,e ) J 

Note that we may write the following inequalities for the first term in the minimum 

(a) 

p(j:* M *,M*)- min p(tt 0 ,Ad(P)) < ,Ad*) - min p(t:* m ^M{P)) 

PGU{P,e) P£U(P,e) 


w 2 yRn 


(i - y ) 2 " 67 I m * 

where (a) follows from 7r 0 being the maximizer in solving the robust MDP, (b) is 
from ( fj~2} with n = tt ^*, and u* M * is the normalized state occupancy frequency of 
the optimal policy . □ 

A.6 Proof of Proposition [6] 


Proof. Let P be the minimizer of min pgW( p ^ p(jr, M(P)). Then from Lemma 
for each 7r: 

p(tt,M(P)) >p{tt,M)- Y^ypJ(I-7P r ) _1 e 7r = p(n,M), 
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The last inequality holds because Ad is differs from Ad only in its reward function 
(see (Q}). □ 


r" = r" — e 


1-7 
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A.7 The Saddle Point Solution Algorithm 

Now we turn to the solution algorithm of Notice that for every fixed A > 0, the 
inner optimization problem in (| 6 | is a robust MDP problem with (s, a)—rectangular 
uncertainty. We now summarize the standard results of robust value iteration from 
MM- For any (x, y) £ X x X, define the A—parametrized robust Bellman’s operator 
7 a : K 1 *** 1 ->• Kl ArxA 'l as follows: 


7a [V]{x,y) = max 
a£A 


r\(x,y, 


a) + 7 min V 

PGU(p,e), , v 

v ’ ' (x',y')eXxX 


(x',y'\x,y,a)V(x',y') 


Since the robust Bellman operator is a contraction mapping, its unique fixed point so¬ 
lution equals to the optimal robust Lagrangian function, i.e., V\(x, y) = 7a[Va](x, y), 
V(x, y) € X x X and 


X! Po{x)Po(x)V\(x, y) 

X,X(zlX 


min max L(V C , 7 r, A). 
p&u(p,e) 


(13) 


Thus the robust value function Va(x, y) can be calculated by robust value iteration 111 61 , 

Va,o (x,y) = V 0 {x,y), V\ tN+1 (x,y) = T\[V\, N ]{x,y), V(x,y) £ XxX, TV e {0,1,2,...,}, 


where the A—parametrized optimal policy 7rJ : X x X —> A and worst case transition 
probability P* form a minimax saddle point of the fixed point solution for A >00 Fur¬ 
thermore, the following sub-gradient descent algorithm finds the Lagrange multiplier 
of problem |6|: 

1. Find the following Lagrange multiplier update: 

A (i+1) = (A« -a® - p(Tr B ,M*))y 

where the step length a® is non-summable, square-summable. 0 

2. Define = min(/^, as the best dual function estimate where 

/ : M>o —>• K is the robust Lagrangian function at 7r = v\, i.e., 

f(X) = max min L(ir,\). 

TrGng p & u(P,e) 

Update the Lagrange multiplier estimate as follows: 

A W+U<_ J ^ +1) if/,Hn +1) =/(A (j+1) ) 

\ X (j> otherwise 

2 Note that A is a linear function of mhip^^p ^ L(n, A) and the worst case minimization is only taken 
with respect to the constraint component. Then P* is independent of A. 

3 The step-size condition satisfies a^’) > o, = ooandE°io (« (i) ) 2 < OO. 
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The following Lemma shows that the solution of the above procedure converges to the 
solution of minA>o /(A). 

Lemma 11. Let X* be the minimizer o/min^>o /(A). The solution of the projected 
sub-gradient descent algorithm converges to X*, i..e, lim J _ ) . 00 A« = A*. 

Proof. Since L("- A) is a linear function of A for any given V c and 7r, we have that 
/(A) = max^gng min pgW ^p ^ L(ir, A) is a convex function in A. By the envelop 
theorem of mathematical economics m. for A > 0 one can write 

= m in p (tt, M(P)) - p(n B , M*). 

CiA PGU(P,e) 


Now, we show that the proposed sub-gradient descent algorithm converges to the 
optimizer of min^> 0 /(A). First, one can write the A iterate as A^ +1 ) = (AC J ’ + L) + 
where 

A«+i) :=A«^a« d/(A) 


dX 


A=A«) 


Since the projection operator is non-expansive, one obtains (A' J ' +1 ) — A*) 2 < (A^ +1 ^ — 
A*) 2 . Furthermore, the following expression holds: 


(A Cj ' +1) - A*) 2 <(A (i+1) - A*) 2 
= [ - a (i) 


df( A) 


dX 




\=\U) 


= (A W - A*) 2 - 2a u \X ( - j) - 


dX 


A=A«> 


(a^y 


fdf(X) 


v dX 


< 


(A^) - A*) 2 - 2a (i) (/(A (j) ) - /(A*)) + (c^Y 


(df(X ) 


V dX 




The inequality is due to the fact that for any A > 0, convexity of /(A) implies that 

, df (A) 


/(A) — /(A*) > (A — A*)- 


dX ' 


This further implies 


(A^ +1 )—A*) 2 < (A (0 >—A*) 2 -^2a^(/(A (9) )-/(A*))+(a^) 2 

q =o ' 


A=A(") 


Since (A^ +1 ^ — A*) 2 is a positive quantity and (A 1 ' 0 ' — A*) 2 is bounded, this further 
implies 


2^>W(/(A ( « ) ) - /(A*)) < (A<°> - A*) 2 + £ (a^) 2 



A=A (") 


2 


:AW) 
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By defining f^ n = min ge /o,...,j} /(A*- 9 )), the above expression implies 




E U a(q) 


(A<"> - A *) 2 + i (o w ) 2 f^r 

q =0 A 


\-XM 


The step-size rule of at?) ensures that the numerator is bounded and the denominator 
goes to infinity as j —> oo. This implies that for any e > 0, there exists a constant N(e) 
such that for any j > N(e), /(A*) < ff^ n < /(A*) + e. In other words, the sequence 
A^) converges to the global minimum A* of /(A). □ 

Combining all previous arguments, the saddle point solution of |6) is given by 
i-A-A\). 


A.8 Proof of Theorem 0 

Proof. Given Assumption[IJ if the solution min>,>o max l£n i min pgW( - p ^ Lp(ir, A) 
is lower-bounded, then weak duality implies that the primal Lagrangian max Tgn j min^>o P(tv. A) 
is also lower-bounded, which further implies that ttar is a safe policy. Otherwise, Al¬ 
gorithm [3]re turns the baseline policy ttr ■ This concludes that the policy ttar. returned 
by Algorithm[3]is safe. 

For the performance loss bound, without loss of generality it is analyzed based 
on MDP A/l X2 (P) with Ai = 1 and A 2 = 0, where A4f 0 (P) = M(P) and 
A4f 0 (P*) = A4*. The proof follows with arguments identical to those in the proof of 
Theorem[5]and is omitted for the sake of brevity. □ 


A.9 Proof of Proposition [8] 

Proof. Suppose that Algorithm [2] re turns a safe policy other than ttr, he., 7 tr f 1 tr. 
Since the class of policies II g is dominating for the optimization problem Q, we may 
write 0 as 

max (14) 

where the feasible policy set 11^ R is defined as 


tsAR 

U S 


= < 7T € n 


A 
S ■ 


PeU(P,e 


p(tt,M(P)) > p(tt Bi .A/1*)}. 


This leads to the primal Lagrangian formulation max ien x min^>o min pgW ^p e ^ Lp(n, A). 
Since ttr ttr is a safe policy and the feasible set 11 (J R is non-empty, the solution of 
the primal Lagrangian is equal to the solution of 0 Furthermore by weak duality, 
we have 


max min min Lp(tt, A) < min max min Lp(7 r, A). 
Trent) A>0 p e u(p,e) A>0 rren|PGW(P,e) 


solution of Algorithm^] 
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On the other hand, consider the following optimization problem: 


max min p(n, M(P)). (15) 

TrGnf« P£U(P,e) 


When np ^ ttb is a safe policy, the feasible set 11{J R is non-empty, and we may write 
the Lagrangian function of ( fl5l > as 

£(7t, A) = min p(n, M (P)) + A ( min p(7r, M(P)) — p(ttb, M*) ] 

P£U(P,e) yPGU(P,e) J 

Since the feasible set is non-empty, we may drop the constraint in G3 and this problem 
becomes equivalent to 

max min p( tt, A4 (P)), 

7rGn| P£U(P,e) 

where the dominating class of policies of this problem is stationary Markovian II 5 . 
Thus the primal Lagrangian duality formulation implies that 

max min£(7r, A) = max min 0 ( 77 ,APP)) = max min p[t:,M{P)\ 
7reIl£A>0 VTGH j P& U(P,e) y Pe U(P,e) K ' 

i.e., max ien j min,\>o P(7r, A) equals to the solution of Algorithm |2j Since the ob¬ 
jective function of ( fl5] > is lower than that of ( | 14| and both problems share the same 
feasible set, it is obvious that 


max min£(7r, A) < max min min Lp(n,X). 

ttGIT^ A>0 A>0 P£U(P,e) 

Combining these arguments, we have just showed that if np is a safe policy, then 
p(7rn,Ad*)< min p(ttr, A4(P)) < min max min Lp(tt,X)< min L J 

PGW(P,e) A>0 Tren^ PeW(P,e) PPU(P,e ) 

where ( ttar , A*) is the maximin saddle-point solution of min pgW jp ^ Lp(ir, A). □ 


A.10 Proof of Theorem [9] 

Proof. Consider an arbitrary P 6M (P, e). Then from Assumption[l]and the construc¬ 
tion of U(P, e), we have: 

\\Pf\x,a) - P*(-\x, a)||i < \\P(-\x,a) - P(-\x, a) + P{-\x, a) + P*(-\x, a)||i 

< \\P(-\ x ,a) - F(-|x,a)||i + ll-P(-|x,a) + p *(- k,a)||i 

< 2e(x, a) . 


Then, using Lemma 
policy 7 r that: 


10 


with the above difference between P and P*. we get for any 


max 

PGW(P,e) 


p(n,p)~ p(n,p*) 


< 


2 ^yPmax 

1-7 




2^yPmax 


II ^7T I! 1,U 


( 16 ) 


( kar , A*), 
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where is the normalized state occupancy frequency for policy it defined as: 


u= (l- 7 )(I- 7 Pj) Vo- 

To prove the safety of note that the objective in ([7]) is always non-negative since 
n B is feasible. Then we get the safety condition by simple algebraic manipulation as 
follows: 

min (p(n h P) - p(tt b ,P)) > 0 

PGU(P,e) K 7 

p(m,P*) > p(n B ,P*) 

The safety of the policy tt\ also implies that its performance loss is bounded by the 
performance loss of the base policy: 

P (7T h P*) - p( 7TJ, P*) < p(TT h P*) - p(jT Bl P*) (17) 


Now we ready to show a bound on the performance loss of -k\ by lower bounding 
p(jT\, P*) as follows: 

p(n, P*) = p{7 Ti, P*)-p(n B , P*)+p{tt Bi P*) > min (p(n h P)~p(tt b , P)\ +p{tt Bi P*) . 

P£U(P,e) v ' 

From the optimality of ni, we further get: 

min (p(iri,P) - p(ir B ,Pj) > nrin (p(tt*,P)-p(tt b ,P)) 

PGU(P,e) v ' PGU(P,e) v 7 

> min p(ir*,P)— max p(tt b ,P). 

PGU(P,e ) P£U(P,e) 

Putting the above together and some simple algebraic manipulation subtracting and 
adding p(7r*, P*), we get: 

p(n\P*)-p(TT h P*) < max (p{n*, P*)-p(n*, P)) + max (p(tt b , P)-p(n B , P*)) ■ 
PGU(P,e) K 7 PGU(P,e) v 7 

The bound in the theorem then follows by bounding the maximization terms above 
using ( fl6| ) and combining the above inequality with ( fl7| . 

Figure [4] depicts an example demonstrating the tightness of the bound. The initial 
state is s o, actions are u \, a 2 , and the transitions are deterministic. £* denotes the true 
transitions, V denotes the worst case in U(P. e), and the leafs shows the returns of the 
remainder of the MDP and are assumed to be known with certainty. The value for e 
is given by © . That is £ represents the sampled transition probability, it would be 
halfway between £* and £ 1 . □ 
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Figure 4: Example showing tightness of bound in Theorem[9] 


B Alternative Bounds 

B.l An Alternative Bound on the Performance Loss of nn a 

Corollary 12. Given Assumption |7j the performance loss ( 1> ( tt p l(I ) also satisfies 

where BR(Ttfi a ) = max lEi Y T[V 7rR “](a;) — V™ Ra (x) is the Bellman residual w.r.t. 

Bellman operator T[V](x) = max a6 ^ |r(:r,a) + P{x'\x,a)V[x')^, value 

function V™ Ra (x) = p(TTR a , At) af tco = x and 7r’~ is the normalized state occupancy 
frequency of the policy tt. 

Proof Similar to the proof of Theorem[l2] we notice that 

p(7T^*, M*)-p(tt 0 , M) = p(7T^., At*) - max pfrr.M) + max p(rr,M) - p(7r 0 , At) 

wens wen s 


(a) 


(b) 


( 18 ) 

First we prove an upper bound for (a). Recall the Bellman operator T for value 
function V : R* — >■ R* asf[V](y) = max„ a \r(y,a) + 'y'Ey'ex PWlv^WW)]- 
Also define the value function V T °(x) as p(ttq,A4) and the optimal value function 
V(x) as max w6 n s p(tt, At), when the initial state is Xq = x. By applying the contrac¬ 
tion mapping property on T , [l/ 7I ' 0 ](y) — V^° (y) for any y £ X and combing with the 
definition of Bellman residual, one obtains 

T 2 [V”°}(y) - T[V*°](y) < 7 BR(7r 0 ). 

By an induction argument, the above expression becomes 

T N [V*°]{y) - T^IV^M < 7 JV_1 BR(7r 0 ), 
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for which by a telescoping sum, it further implies 

N N 

T N [V*°](y) - V*°{y) = £ f' k [V*°){y) - T*" 1 [?’<>](») < ^y^BRfa). 

fc—1 k =1 

By letting TV —> oo and noticing that limjv->-oo T N [V n °](y) = V(y), one finally ob¬ 
tains 


(b)< 


max p(ir, M) - p( 7r 0 , M) 

7rGn s 


(V{y)-V*°{y) 


yex 


N 


< lim y 7 fe - 1 BR( 7 ro) = ^^. 

N—f oo \ ^ 1 — 7 


fc=l 

(19) 

and 


10 


For an upper bound in (a), by interchanging M and A4 in the derivation of 
taking maximum on both sides, we also have, 

2 7 T? max j. ^ _ 2 7 ii max ! 


(a) = max p(w, M.*) — max p(n,A4) < max 

ttGITs ttGIIs 7rGlIs 1 — 7 


-PoQ-'yPir) e * < max 


Then the proof is completed by combining both parts of the above arguments. □ 


B.2 An Alternative Bound on the Performance Loss of n R 

Corollary 13. Given Assumption^ 7] the performance loss $(7Tr) also satisfies 


< f ) ( 7r tt) < min 


BR(tt r ) . 2 7 i? max M ,, 

1——+ ™ax mi S 7i—vi2ll e ^(p)lli 

l- 7 T6n s p e U(P,e) (1 - l) 


$( 7 T B ) 


where BR(tt r ) = max ie ^ |7~[V’ 7rii ](x) — C"’- R (x)| is the Bellman residual w.r.t. Bell¬ 
man operator T[V](x) = max o£ _4 | r(x,a) + 7 min pgW( p e) Jfx’&x P( x '\ x , a)V(x')}, 

va/ue function V" KR {x) = min pgW ^p , p(n R , Ai(P)) at x0 = x and Um(P) is the 
normalized state occupancy frequency of the policy n. 

Proof For the proof of the alternative performance bound, by following the same anal¬ 
ysis as in the derivation of ( p~8| ), replacing the bellman operator T with the robust Bell¬ 
man operator: 


T[V](y) = ma ? < r(y,a) + 7 min V P{y'\y, a)V(y') > , 

° 6 ^ I P£U(P,e) y f^ x J 


and defining the robust value function l/ 71 ' 0 (x) as min pe ^,p e ^ p(tto, At(P)) when the 
initial state is xq = x, we can easily show that 

max min p(n,M(P)) — min p(ir 0 ,M(P)) < ^( 7r °) ; 

TTGn s P£U(P,e) PGU(P,e) 1“ 7 
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and 


max p(n, M*) — max min p(tt, Ai(P)) < max min ——— "/P^) 1 e n 
7rGn s 7ren s P&A (P,e) ^ng peu(P,e) 1 — 7 


. ^^Pmax | 

< max mm --— | 

wens p e u(P,e) (1 - 7) 


~' 7r j\4(.P) II l’“Xl(P) ‘ 


The proof is then completed by combining both of the above results. 


□ 


B.3 An Alternative Bound on the Performance Loss of 7 t A r 

Corollary 14. Given Assumption^ 7] the performance loss c l >(tt 4 p) also satisfies 


®(,tt A r) < min 


BR(tt ar ) 

1-7 


2 'yR 

max mm -- 

7rGn s PGU (P,e) (1 - 7) 


max I 

2 I 


~"^M.(P) 111 :^A4(P) 



where is the normalized state occupancy frequency of the policy n and the 

Bellman residual for robust MDPs is BR(ttar) — BRi${ttar), for which the generic 
case (Ai, A 2 ) > 0, the Bellman residual w.r.t. Bellman operator 7\ 1 A 2 [V]{x,y) = 

max ae _4 | r Ai,A 2 ( x > y-> a ) + 7 min p eW (p,e) Hx'yex P A (x', v'\x, y, a)V (x y') | is given 
by BR\ 1} \ 2 (tt R s) = ma x x>yeX |7a!A 2 [VZX](x, y) - V£*£(x, y) \ where V* AR {x, y) = 
min pgW( .p p(n AR , ■M \ 1 \ 2 (P)) the value function at So = x. 

Proof Since to the proof of Theorem [7] the proof of this corollary follows identical 
arguments from Corollary [13] and is omitted for the sake of brevity. □ 
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